Symbolic Regression

Modeling Molecular Mass

Delanyce Rose & Richard Henry

2025-08-01

Outline

  • Symbolic Regression
  • Molecular Mass
  • Typical Workflow
  • Toy Dataset
  • Software Implementation
  • Noisy Toy
  • Belly of the Beast
  • Quiet Toy
  • Alternatives to Evolution
  • Data Exploration
  • Modeling and Results
  • Conclusions

Methods

Definition

  • independent variable \(X\)
  • dependent variable \(y\)
  • analytical expression \(f(X)\)
  • \[\hat{y}_{23} \approx f(X_{23})\]

Symbolic Regression Alternatives

  • neural networks
  • \[\frac{dy}{dX}|_{X_{23}}\approx \frac{\hat{y}_{24}-\hat{y}_{23}}{X_{24}-X_{23}}\]
  • linear regression
  • \[\hat{y}_{23} \approx \beta_0 +\beta_1 \cdot X\]
  • best form?
  • \[\hat{y}_{23} \approx \beta_2 +\beta_3 \cdot e^X\]
  • \[\hat{y}_{23} \approx \beta_4 +\beta_5 \cdot log(X)\]
  • \[log(\hat{y}_{23}) \approx \beta_6 +\beta_7 \cdot log(X)\]
  • \[\hat{y}\approx\frac{\beta_0+\beta_1x+\beta_2x^2}{\beta_3+\beta_4x}\]
  • Alternating Conditional Expectations
  • Kolmogorov-Arnold Network

Typical Workflow

  • what
  • how
  • when
  • initial population
  • fitness evaluation
  • stop ??
  • new population

Toy Dataset

\[\gamma_{API}=\frac{141.5}{\gamma_{o}}-131.5\]

Symbol Meaning Units
\(\gamma_{API}\) API Gravity [degAPI]
\(\gamma_o\) specific gravity [1/wtr]

Software Implementation

from pysr import PySRRegressor()

myMod=PySRRegressor()

myMod.fit(x.reshape(-1, 1),y)

y_pred=myMod.predict(x.reshape(-1, 1))

Software Implementation

myEq=myMod.sympy()

\[ x_0-(x_0+0.013196754)+1.0131962+ \frac{x_0 (-132.5)- -141.5}{x_0} \]

sym.simplify(myEq)

\[-131.500000554+\frac{141.5}{x_0}\]

Noisy Toy

w=(random.rand(21)-0.5)*2.5

\[-132.05688+\frac{141.88547}{x_0}\]

Belly of the Beast (Encoding)

binary_operators=["+","-","*","/"]

unary_operators=None

maxsize=30

1 2 3 4 5
minus divide 141.5 specific gravity 131.5

maxdepth=None

Belly of the Beast (Fitness)

elemtwise_loss="L2DistLoss()"

\[ \frac{1}{n} \sum_{k=1}^{n}{\left(\hat{y}_k-y_k \right)}^2\]

\[ \frac{1}{n} \sum_{k=1}^{n}{\left|\hat{y}_k-y_k \right|}\]

parsimony=0.0

model_selection="best"

should_simplify=True

Tournament_selection_n=15

Belly of the Beast

Stopping

niterations=100

max_evals=None

timeout_in_seconds=None

Population

populations=31

population_size=27

Quiet Toy

\[ \frac{a_{00}}{x}+a_{01}\]

w<-seq(from=121.5,to=161.5, by=10)

<expr> ::= <op>(<expr>, <expr>) | <var> | <con>
<op>   ::= "+" | "-" | "*" | "/"
<var>  ::= x
<con>  ::= 121.5 | 131.5 | 141.5 | 151.5 | 161.5
Grammatical Evolution Search Results:
  No. Generations:  1531 
  Best Expression:  141.5/x - 131.5 
  Best Cost:        0 

Alternatives to Evolution

  • Deterministic
  • Information Technology
  • Neural Networks

Alternatives to Evolution

  • Deterministic
    • Brute Force
    • Mathematical Programming
    • Sparse Regression
  • Information Technology
    • data-centric
    • expert-centric
  • Neural Networks
    • Transformers
    • Equation-Builders

Analysis and Results

Data Exploration / Visualization

Variable Description Designation
\(Mw\) Molecular Mass dependent
\(SG\) Specific Gravity independent
\(TBP\) True Boiling Point independent
       SG              TBP               MW        
 Min.   :0.6310   Min.   : 306.0   Min.   :  70.0  
 1st Qu.:0.7439   1st Qu.: 373.2   1st Qu.:  99.0  
 Median :0.8474   Median : 584.0   Median : 222.5  
 Mean   :0.8470   Mean   : 575.2   Mean   : 304.6  
 3rd Qu.:0.8784   3rd Qu.: 668.5   3rd Qu.: 297.8  
 Max.   :1.3793   Max.   :1012.0   Max.   :1685.0  

Univariate Analysis

Raw Molecular Mass Histogram

Univariate Analysis

Raw Molecular Mass Box-and-Whiskers Plot

Univariate Analysis

Raw Boiling Point Box-and-Whiskers Plot

Univariate Analysis

Raw Boiling Point Histogram

Univariate Analysis

Raw Specific Gravity Histogram

Univariate Analysis

Raw Specific Gravity Box-and-Whiskers Plot

Bivariate Analysis

Molecular Mass vs Specific Gravity Goossens Data

Bivariate Analysis

Molecular Mass vs Boiling Point Goossens Data

Bivariate Analysis

Specific Gravity vs Boiling Point Goossens Data

Test vs. Train

Molecular Mass vs Boiling Point. Both Datasets

Test vs. Train

Molecular Mass vs Specific Gravity. Both Datasets

Existing Correlations

Symbol Meaning
\(M_w\) Apparent Molecular Mass
\(T_b\) True Boiling Point Temperature
\(\gamma_o\) Specific Gravity
\(a_{00}..a_{09}\) Empirical Constants
\(K_w\) Characterization Factor (intermediate value)
\(X_0...X_3\) Intermediate Variables

Existing Correlations

Hariu & Sage (1969)

\[ M_w = a_{00} + a_{01} K_w + a_{02} K_w^2 + a_{03} T_b K_w + a_04 T_b K_w^2 + a_{05} T_b^2 K_w + a_{06} T_b^2 K_w^2 \]

\[K_w =\frac{\sqrt[3]T_b}{\gamma_o}\]

Kesler & Lee (1976)

\[M_w = X_0 + \frac{X_1}{T_b} + \frac{X_2}{T_b^2}\]

\[X_0 = a_{00} + a_{01} γ_o+ \left (a_{02} + a_{03} γ_o \right ) T_b\]

\[ X_1 = \left (1+ a_{04} γ_o + a_{05}γ_o^2 \right ) \left (a_{06} + \frac{a_ {07}}{T_b} \right ) \cdot 10^7 \]

\[ X_2 = \left (1+ a_{08} γ_o+ a_{09} γ_o^2 \right ) \left (a_{10} + \frac{a_{11}}{T_b} \right ) \cdot 10^{12} \]

American Petroleum Institute (1977)

\[ M_w = a_{00} e^ {\left (a_{01} T_b \right )} e^{\left (a_{02} γ_o \right )} T_b^{a_{03}} γ_o^ {a_{04}} \]

Winn, Sim & Daubert (1980)

\[M_w = a_{00} T_b^{a_ {01}} γ_o^{a_{02}}\]

Riazi & Daubert (1980)

\[M_w = a_{00} T_b^{a_ {01}}γ_o^{a_{02}}\]

Rao & Bardon (1985)

\[ln {M_w} = (a_{00} + a_{01} K_w) ln (\frac{T_b} {a_{02} + a_{03} K_w} )\]

Riazi & Daubert (1987)

\[ M_w = a_{00} T_b^{a_{01}} γ_o^{a_{02}} e^{\left (a_{03} T_b + a_{04} γ_o + a_{05} T_b γ \right )} \]

Goossens (1996)

\[M_w = a_{00} T_b^{X_0}\]

\[ X_0 =\frac {a_{03} + a_{04} ln {\left (\frac{T_b} {a_{05} - T_b} \right )}} {a_{01} γ_o + a_{02}} \]

Linan (2011)

\[ M_w = a_{00} e^{\left (a_{01} T_b \right )} e^{\left (a_{02} γ_o \right )} T_b^ {a_{03}} γ_o^{a_{04}} \]

Hosseinifar & Shahverdi (2021)

\[M_w = {\left [a_{00} T_b^{a_{01}} {\left (\frac{3+2γ_o} {3-γ_o} \right )}^{\frac{a_{02}}{2}} + a_{03} T_b^{a_{04}} {\left (\frac{3+2γ_o}{3-γ_o} \right )}^{\frac{a_{05}}{2}} \right ]}^{a_{06}}\]

Stratiev (2023)

\[ M_w = a_{00} + a_{01} e^{\left [a_{02} e^{\left (a_{03} \frac{T_b^{a_{06}}}{γ_o^{a_{05}}} \right )} \right ]} \]

Operators to Consider

Partial List
Operator Type Description
pow binary one expression raised to the power of another
log unary logarithm of an expression
exp unary antilogarithm of an expression
sqr unary expression squared
cub unary expression cubed
inv unary inverse of an expression

Modeling and Results

Default Run

\[ M_w=a_{00}+\frac{a_{01}}{\gamma_o-a_{02}}+\frac{T_b\cdot (a_{03}\cdot T_b-a_{04})}{\gamma_o\cdot (a_{05}-\frac{a_{06}}{T_b})} \]

Correlation Coefficients
Raw MW Goossens
Equation
This
Equation
Raw MW 1.000000 0.999711 0.999847
Goossens Equation 0.999711 1.000000 0.999798
This Equation 0.999847 0.999798 1.000000

Q-Q Plot for the Default Run Residuals

Power Run

\[M_w=a_{00} \cdot a_{01}^{\gamma_o^{-a_{02}} + a_{03} \cdot T_b} + a_{04}\]

Correlation Coefficients
Raw MW This Equation
Raw MW 1.000000 0.997281
This Equation 0.997281 1.000000

Power Run

Exponential Run

\[ M_w= - T_b \cdot \left(a_{00} \cdot T_b - a_{01}\right) \left(a_{02} \cdot 10^{-6} \left(a_{03} \cdot \gamma_o - 2 \cdot T_b \right) \left(T_b - a_{04}\right) - 1\right) + a_{05}\]

\[ M_w= a_{00}\cdot T_b + a_{01} \cdot e^{- \gamma_o^{2} + a_{02}\cdot \gamma_o + a_{03} \cdot T_b}\]

Correlation Coefficients
Raw MW 1st
Equation
2nd
Equation
Raw MW 1.000000 0.997705 0.998420
First Equation 0.997705 1.000000 0.999497
Second Equation 0.998420 0.999497 1.000000

Raw Data Scale Change

Aeon Run

\[ M_w=(a_{00}\cdot T_b-a_{01})e^{a_{02}\cdot 10^{-9}T_b^2(a_{03}\cdot \gamma_0 +a_{04}\cdot T_b)} \]

\[ M_w=-a_{00}\cdot \gamma_o+\frac{e^{-\gamma_o^2+{log(T_b)}^3}}{T_b^{a_{01}}} +a_{02}\cdot T_b \]

Correlation Coefficients
Raw Mw First Equation Second Equation
Raw Mw 1.000000 0.998880 0.999324
First Equation 0.998880 1.000000 0.998851
Second Equation 0.999324 0.998851 1.000000

Box-Cox Run

\(\lambda=-0.3624\)

transformed molecular mass histogram

Log vs Box-Cox Transform

Box-Cox vs. Inv. Sqr. Root

transformed Residuals, Box-Cox Run

Shapiro-Wilk p-value \(0.0507\)

Untransformed prediced vs. actual \[ M_w^t = \left (\gamma_o \left(a_{00} T_b -a_{01} \right) log{(\gamma_o)} + a_{02} \right) log{(T_b)} \]

Reverse the Box-Cox transform:

\[ M_w = {\left(\lambda M_w^t +1 \right)}^{\frac{1}{\lambda}} \]

Correlation Coefficients (Transform Reversed)
Raw Mw This Equation
Raw Mw 1.000000 0.995897
This Equation 0.995897 1.000000

Residuals outside of Box-Cox Space

First Sparse Run

\[ M_w=-a_{00} \cdot \gamma_o +\frac{a_{01}\cdot \gamma_o}{T_b}+a_{02}\cdot T_b -a_{03}\]

Correlation Coefficients
Raw Mw This Equation
Raw Mw 1.000000 0.970451
This Equation 0.970451 1.000000

First Sparse Run

Second Sparse Run

\[ M_w=-a_{00}\cdot \gamma_o\cdot T_b +a_{01}\cdot T^2_b +a_{02}\cdot e^{\gamma_o} +a_{03} \qquad(1)\]

Correlation Coefficients
Raw Mw
Raw Mw 1.000000
1st Run Equation 0.970451
2nd Run Equation 0.985800

Second Sparse Run

Validation Runs

Table 1: Validation Run Correlation Coefficients
Run Correlation Coefficient
Raw Mw 1.000000
Default 0.999957
Goossens Correlation 0.999939
Power 0.996964
Exponential #2 0.998954
Aeon #1 0.999973
Aeon #2 0.999921
Box-Cox 0.999696
Sparse #1 0.992303
Sparse #2 0.996331

All of these are very good, with the validation coefficients exceeding the training coefficients. We will resist the temptation to declare victory over overfitting, as the Hosseinifar dataset is far better behaved than the Goossens dataset.

A better approach with the evolutionary model is probably to start increasing parsimony or decreasing maxsize or maxdepth until the correlation coefficients start to deteriorate.

A different approach is warranted with the deterministic models. It is possible that these are underfit, meaning that we can expand the number of unique terms the algorithm initially considers, or the number of terms we want to keep.

General Observations

The “UOP Characterization Factor” ?@eq-UOP which (at least) two researchers decided was an important correlating variable for molecular mass, was not discovered by our symbolic regression model.

In fairness, however, we dropped the division operator early in our experimentation, and never asked the model to consider a “cube root” unary operator.

Replication is very challenging with these symbolic regression libraries that use evolutionary algorithms. Re-running the same dataset on the same machine with the same libraries produces different equations. In many Scikit-Learn algorithms, answer variation can be minimized by seeding the random number generator the same way each time, but in PySR we also have to turn off the optimizations that allow Julia to run the calculations massively in parallel, which contributes to the criticism that these algorithms are slow.

The deterministic symbolic regression libraries don’t need random number seeds, and the answers stay the same between runs. The equations are definitely more explainable, but appear to be less accurate. Does this accuracy matter? Is it real? The answer to these questions depend on how good we think our measurements are!

In the engineering world, an old trick to speed-up calculations is to use look-up tables instead of functions. The idea is that even though we may know exactly what the governing equations are, producing look up tables from these functions ahead of time and then applying Gaussian interpolation in real time may be faster. This is a very similar concept to the idea of pre-training a transformer network to do symbolic regression.

This paper has shown that high fidelity models can be constructed from “cheap” components (i.e. arithmetic operators). Maybe explainability is overrated, and functions carefully built using symbolic regression may be competitive with old-man Gauss!

If this turns out to be true, then symbolic regression will make the full circle and return to it roots.

Conclusions

In summary, we have explored Symbolic Regression using evolutionary and deterministic algorithms on our molecular mass dataset. We have not spent much time on different kinds of engines (e.g. neural networks, mathematical programming) or tested the extrapolation chops of this technology.

Our main conclusions are:

  1. Symbolic Regression can generate reasonable predictions when trained on datasets too small for neural networks.

  2. Evolutionary Symbolic Regression will typically generate novel equations from the human point of view, but these equations can be as accurate than the ones created by humans.

  3. The “explainability” of the generated equations are probably overrated, unless enough guardrails are put up to constrain the space searched by the evolutionary Symbolic Regression algorithm.

  4. The ability of Evolutionary Symbolic Regression to easily find alternative equations of about the same accuracy means that humans can repeat the workflow until an equation that is more “explainable” than the others is generated.

  5. The structure of an equation generated with symbolic regression will withstand some noise.

  6. Piping the “winning” equation through a symbolic mathematics package to simplify and consolidate it is a good idea. The effect on the deterministic equations is very slight.

  7. This consolidation often introduces relationships between variables and constants that were not specified before the algorithm was started. This can nudge the human to reconsider relationships previously rejected.

  8. All Symbolic Regression libraries using evolutionary algorithms are not created equal. Library comparison is beyond the scope of this report, but we can comment that PySRwas one of the better ones.

  9. Running both evolutionary and deterministic models on the same problem is also a good idea.

References